Author : Vu Tran. Other info is on github

Kaggle Competition: San Francisco Crime Classification

Info from Competition Site
- Description
- Evaluation
- Data Set
First attempt:
- Working with data
- Feature 'review'
- Training Naive Bayes
- Predicting with Naive Bayes
- Preparing for kaggle submission
- Performance Evaluation
  - Splitting train data set
  - Evaluating performance using splitted data set
  - Plotting ROC curve
- Hyperparameters
- Other improvements
Second attempt (in progress)

Info from Competition Site

Description

Predict the category of crimes that occurred in the city by the bay

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes.

Evaluation

Submissions are evaluated using the multi-class logarithmic loss. Each incident has been labeled with one true class. For each incident, you must submit a set of predicted probabilities (one for every class). The formula is then,

logloss=−1N∑i=1N∑j=1Myijlog(pij), logloss=−1N∑i=1N∑j=1Myijlog⁡(pij),

where N is the number of cases in the test set, M is the number of class labels, loglog is the natural logarithm, yijyij is 1 if observation ii is in class jj and 0 otherwise, and pijpij is the predicted probability that observation ii belongs to class jj.

The submitted probabilities for a given incident are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,1−10−15),10−15)max(min(p,1−10−15),10−15).

Submission Format

You must submit a csv file with the incident id, all candidate class names, and a probability for each class. The order of the rows does not matter. The file must have a header and should look like the following:

Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,EXTORTION,FAMILY OFFENSES,FORGERY/COUNTERFEITING,FRAUD,GAMBLING,KIDNAPPING,LARCENY/THEFT,LIQUOR LAWS,LOITERING,MISSING PERSON,NON-CRIMINAL,OTHER OFFENSES,PORNOGRAPHY/OBSCENE MAT,PROSTITUTION,RECOVERED VEHICLE,ROBBERY,RUNAWAY,SECONDARY CODES,SEX OFFENSES FORCIBLE,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS 0,0.9,0.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 ... etc.

Data Set

Data Files

File Name	Available Formats
test.csv	.zip (18.75 mb)
sampleSubmission.csv	.zip (2.38 mb)
train.csv	.zip (22.09 mb)

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set.

Data fields

Dates - timestamp of the crime incident
Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
Descript - detailed description of the crime incident (only in train.csv)
DayOfWeek - the day of the week
PdDistrict - name of the Police Department District
Resolution - how the crime incident was resolved (only in train.csv)
Address - the approximate street address of the crime incident 
X - Longitude
Y - Latitude

First attempt

Working with data



In [1]:

    
import pandas as pd
import zipfile

#reading train dataset:
archive=zipfile.ZipFile("C:/Users/vutran/Desktop/github/kaggle/San Francisco Crime Classification/data/train.csv.zip",'r')
train_data=pd.read_csv(archive.open("train.csv"))
train_data.head()









    Out[1]:






  
    
      
      Dates
      Category
      Descript
      DayOfWeek
      PdDistrict
      Resolution
      Address
      X
      Y
    
  
  
    
      0
      2015-05-13 23:53:00
      WARRANTS
      WARRANT ARREST
      Wednesday
      NORTHERN
      ARREST, BOOKED
      OAK ST / LAGUNA ST
      -122.425892
      37.774599
    
    
      1
      2015-05-13 23:53:00
      OTHER OFFENSES
      TRAFFIC VIOLATION ARREST
      Wednesday
      NORTHERN
      ARREST, BOOKED
      OAK ST / LAGUNA ST
      -122.425892
      37.774599
    
    
      2
      2015-05-13 23:33:00
      OTHER OFFENSES
      TRAFFIC VIOLATION ARREST
      Wednesday
      NORTHERN
      ARREST, BOOKED
      VANNESS AV / GREENWICH ST
      -122.424363
      37.800414
    
    
      3
      2015-05-13 23:30:00
      LARCENY/THEFT
      GRAND THEFT FROM LOCKED AUTO
      Wednesday
      NORTHERN
      NONE
      1500 Block of LOMBARD ST
      -122.426995
      37.800873
    
    
      4
      2015-05-13 23:30:00
      LARCENY/THEFT
      GRAND THEFT FROM LOCKED AUTO
      Wednesday
      PARK
      NONE
      100 Block of BRODERICK ST
      -122.438738
      37.771541



In [2]:

    
train_data.tail()









    Out[2]:






  
    
      
      Dates
      Category
      Descript
      DayOfWeek
      PdDistrict
      Resolution
      Address
      X
      Y
    
  
  
    
      878044
      2003-01-06 00:15:00
      ROBBERY
      ROBBERY ON THE STREET WITH A GUN
      Monday
      TARAVAL
      NONE
      FARALLONES ST / CAPITOL AV
      -122.459033
      37.714056
    
    
      878045
      2003-01-06 00:01:00
      LARCENY/THEFT
      GRAND THEFT FROM LOCKED AUTO
      Monday
      INGLESIDE
      NONE
      600 Block of EDNA ST
      -122.447364
      37.731948
    
    
      878046
      2003-01-06 00:01:00
      LARCENY/THEFT
      GRAND THEFT FROM LOCKED AUTO
      Monday
      SOUTHERN
      NONE
      5TH ST / FOLSOM ST
      -122.403390
      37.780266
    
    
      878047
      2003-01-06 00:01:00
      VANDALISM
      MALICIOUS MISCHIEF, VANDALISM OF VEHICLES
      Monday
      SOUTHERN
      NONE
      TOWNSEND ST / 2ND ST
      -122.390531
      37.780607
    
    
      878048
      2003-01-06 00:01:00
      FORGERY/COUNTERFEITING
      CHECKS, FORGERY (FELONY)
      Monday
      BAYVIEW
      NONE
      1800 Block of NEWCOMB AV
      -122.394926
      37.738212



In [3]:

    
train_data.dtypes









    Out[3]:





Dates          object
Category       object
Descript       object
DayOfWeek      object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
dtype: object



In [4]:

    
train_data.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: float64(2), object(7)
memory usage: 67.0+ MB



In [11]:

    
pd.unique(train_data.Category)









    Out[11]:





array(['WARRANTS', 'OTHER OFFENSES', 'LARCENY/THEFT', 'VEHICLE THEFT',
       'VANDALISM', 'NON-CRIMINAL', 'ROBBERY', 'ASSAULT', 'WEAPON LAWS',
       'BURGLARY', 'SUSPICIOUS OCC', 'DRUNKENNESS',
       'FORGERY/COUNTERFEITING', 'DRUG/NARCOTIC', 'STOLEN PROPERTY',
       'SECONDARY CODES', 'TRESPASS', 'MISSING PERSON', 'FRAUD',
       'KIDNAPPING', 'RUNAWAY', 'DRIVING UNDER THE INFLUENCE',
       'SEX OFFENSES FORCIBLE', 'PROSTITUTION', 'DISORDERLY CONDUCT',
       'ARSON', 'FAMILY OFFENSES', 'LIQUOR LAWS', 'BRIBERY',
       'EMBEZZLEMENT', 'SUICIDE', 'LOITERING', 'SEX OFFENSES NON FORCIBLE',
       'EXTORTION', 'GAMBLING', 'BAD CHECKS', 'TREA', 'RECOVERED VEHICLE',
       'PORNOGRAPHY/OBSCENE MAT'], dtype=object)



In [12]:

    
pd.unique(train_data.Category).shape









    Out[12]:





(39L,)

Now that we already have general idea of Data Set. We next clean, transform data to create useful features for machine learning

Feature 'Dates' and 'DayOfWeek'

Deature Dates include both date and time, I'll shall only use time



In [51]:

    
feature_hour=pd.to_datetime(train_data.Dates).dt.hour
pd.unique(feature_hour)









    Out[51]:





array([23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10,  9,  8,  7,
        6,  5,  4,  3,  2,  1,  0], dtype=int64)



In [ ]:

    
dow = {
    'Monday':0,
    'Tuesday':1,
    'Wednesday':2,
    'Thursday':3,
    'Friday':4,
    'Saturday':5,
    'Sunday':6
}
df['dayofweek'] = df.DayOfWeek.map(dow)

	Dates	Category	Descript	DayOfWeek	PdDistrict	Resolution	Address	X	Y
0	2015-05-13 23:53:00	WARRANTS	WARRANT ARREST	Wednesday	NORTHERN	ARREST, BOOKED	OAK ST / LAGUNA ST	-122.425892	37.774599
1	2015-05-13 23:53:00	OTHER OFFENSES	TRAFFIC VIOLATION ARREST	Wednesday	NORTHERN	ARREST, BOOKED	OAK ST / LAGUNA ST	-122.425892	37.774599
2	2015-05-13 23:33:00	OTHER OFFENSES	TRAFFIC VIOLATION ARREST	Wednesday	NORTHERN	ARREST, BOOKED	VANNESS AV / GREENWICH ST	-122.424363	37.800414
3	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT FROM LOCKED AUTO	Wednesday	NORTHERN	NONE	1500 Block of LOMBARD ST	-122.426995	37.800873
4	2015-05-13 23:30:00	LARCENY/THEFT	GRAND THEFT FROM LOCKED AUTO	Wednesday	PARK	NONE	100 Block of BRODERICK ST	-122.438738	37.771541

	Dates	Category	Descript	DayOfWeek	PdDistrict	Resolution	Address	X	Y
878044	2003-01-06 00:15:00	ROBBERY	ROBBERY ON THE STREET WITH A GUN	Monday	TARAVAL	NONE	FARALLONES ST / CAPITOL AV	-122.459033	37.714056
878045	2003-01-06 00:01:00	LARCENY/THEFT	GRAND THEFT FROM LOCKED AUTO	Monday	INGLESIDE	NONE	600 Block of EDNA ST	-122.447364	37.731948
878046	2003-01-06 00:01:00	LARCENY/THEFT	GRAND THEFT FROM LOCKED AUTO	Monday	SOUTHERN	NONE	5TH ST / FOLSOM ST	-122.403390	37.780266
878047	2003-01-06 00:01:00	VANDALISM	MALICIOUS MISCHIEF, VANDALISM OF VEHICLES	Monday	SOUTHERN	NONE	TOWNSEND ST / 2ND ST	-122.390531	37.780607
878048	2003-01-06 00:01:00	FORGERY/COUNTERFEITING	CHECKS, FORGERY (FELONY)	Monday	BAYVIEW	NONE	1800 Block of NEWCOMB AV	-122.394926	37.738212